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Abstract 

This  paper  presents  the  CMU  submission  to  the  2008 
TREC  blog  distillation  track.  Similar  to  last  year’s 
experiments,  we  evaluate  different  retrieval  models 
and  apply  a  query  expansion  method  that  leverages 
the  link  structure  in  Wikipedia.  We  also  explore  us¬ 
ing  a  corpus  that  combines  several  different  represen¬ 
tations  of  the  documents,  using  both  the  feed  XML 
and  permalink  HTML,  and  apply  initial  experiments 
with  spam  filtering. 


1  Introduction 

The  CMU  submission  to  the  2008  blog  distillation 
track  explored  document  representation,  retrieval 
models,  query  expansion,  and  spam  filtering.  CMU’s 
retrieval  system,  based  on  the  Indri  search  engine  , 
used  a  combined  index  of  the  permalink  and  feed 
documents,  differentially  weighting  text  from  various 
parts  of  the  HTML  and  XML.  Two  retrieval  models 
were  applied:  the  large  document  model,  where  each 
feed  is  viewed  as  a  single  document;  and  the  small 
document  model,  where  a  feed  is  represented  as  a  col¬ 
lection  of  individual  entry  documents.  Similarly  to 
last  year’s  submission,  our  query  expansion  method 
leverages  the  link  structure  in  Wikipedia.  A  spam 
filtering  component  was  also  integrated. 

2  Document  Representation 

Although  our  system  last  year  successfully  made  use 
of  only  the  feed  (XML)  documents,  subsequent  test¬ 
ing  indicated  that  using  the  permalink  (HTML)  doc¬ 
uments  could  provide  some  performance  improve¬ 
ments.  In  order  to  leverage  both  representations  of 
the  blog  feeds,  this  year’s  submission  used  a  combined 
index  in  which  each  indexed  blog  contains  the  text 
from  both  the  permalink  and  feed  documents.  These 
separate  representations  of  the  blogs  are  indexed  as 

1  http:  /  /  www.lemurproject.org/ indri 


fields  in  Indri,  allowing  flexible  access  to  the  different 
document  representations  at  query  time. 

The  fields  represented  in  our  index,  given  in  Ta¬ 
ble  1,  include  the  full  text  of  the  HTML  pages 
(permtext),  as  well  as  several  structural  elements  of 
the  feed  XML  documents. 


Field  Name  Description  (Source) 


permtext 

title 

entrytitle 

entrybody 


Permalink  Text  (HTML) 
Feed  title  (XML) 
Entry  title  (XML) 
Entry  Content  (XML) 


Table  1:  Indexed  fields 


3  Retrieval  Models 


We  applied  two  retrieval  models  to  the  task  of  blog 
distillation:  the  large  document  model,  which  treats 
each  blog  or  feed  as  a  single  (large)  document,  and  the 
small  document  model,  which  retrieves  blog  entries 
individually  and  aggregates  an  entry  ranking  into  an 
overall  feed  ranking. 


3.1  Large  Document  Model 

The  large  document  model  treats  each  feed  as  a  con¬ 
catenation  of  all  its  entries.  In  this  model,  documents 
are  ranked  by  the  posterior  probability  of  observing 
the  feed  given  the  query, 

Pld(F\Q)  =  Pld(Q,F)/P(Q) 

ra=k  pod  PLom r  (i) 

Document  Query 
Prior  Likelihood 

The  query  likelihood  component  is  estimated  as  a 
weighted  combination  of  query  likelihoods  from  the 
different  document  representations:  permalink  text, 
feed  title,  entry  titles  and  entry  content, 

Pld{Q\F)  =  n  Pld(Q\fW)v*,  (2) 
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where  the  j  denotes  different  representations  and  Vj 
are  learned  weights  for  each  of  those  representations. 
This  representation-specific  query  likelihood  compo¬ 
nent  is  estimated  with  Metzler  &  Croft’s  full  depen¬ 
dence  model  [6,  2]  using  Dirichlet-smoothed  maxi¬ 
mum  likelihood  estimates  [7] , 


ity  of  observing  the  feed  given  the  query,  we  have, 


Psd{F\Q)  =  }  J2  Psd(Q,E,F) 

egf 

ra?kp(F)  ^  P(Q\E,F)P(E\F ) 

EeF 


Pld(Q\FU) ) 


n  pld(a\fu)) 

V-<e»(Q) 


n 


( tfiPi-FU)  +  I^PMLE('lpi\C) 

V  1^1  +  M 


(3) 


The  ifi  are  query  unigram  and  term  window  query 
features,  and  the  dependence  model  weights  u;,;  are 
taken  from  previous  literature.  This  complex  query 
formulation  can  be  expressed  in  the  following  Indri 
query  template  shown  in  Table  2,  where  <unigrams> 


=  p(F)  E  p{QjE)  w) , 

Feed  E^F  Query  Entry 
Prior  Likelihood  Centrality 

(4) 

where  the  last  line  holds  if  we  assume  queries  are 
conditionally  independent  of  feeds  given  the  entry.  As 
in  the  large  document  model,  Equation  2,  the  query 
likelihood  is  calculated  via  a  combination  of  different 
document  representations 

Psd{Q\E)  =  HPsD(Q\E^)Vi,  (5) 
j 


#weight(  vl  #weight( 

0.8  #combine (<unigrams> . (permtext)) 

0.1  #combine (Cordered  windows> . (permtext) ) 

0.1  #combine (Cunordered  windows> . (permtext) ) ) 
v2  #weight( 

0.8  #combine (<unigrams> . (title) ) 

0.1  #combine (<ordered  windows> . (title) ) 

0.1  #combine (Cunordered  windows> . (title) ) ) 
v3  #weight( 

0 . 8  #combine (<unigrams> . (entrytitle) ) 

0.1  #combine (<ordered  windows> . (entrytitle) ) 

0.1  #combine (Cunordered  windows> . (entrytitle) ) ) 
v4  #weight( 

0 . 8  #combine (<unigrams> . (entrybody) ) 

0.1  #combine (<ordered  windows> . (entrybody) ) 

0.1  #combine (<unordered  windows> . (entrybody) )) ) 


and  these  query  likelihood  components  are  estimated 
with  a  full  dependence  model  using  Jelinek-Mercer 
smoothing  [6,  7]  to  combine  both  the  entry  and  feed 
language  models, 

Psd(Q\E W)  =  JJ  PJM(HE{j))Wi 
V>4e*(Q) 

=  ]^[  ^ePmle^iIE^) 

ipi&'S’iQ) 

+  ^FPMLE^ilF^P) 

OWi 

(6) 

The  centrality  component  of  this  model  is  given  by, 


Table  2:  Large  Document  Indri  Query  Template 


P(E\F ) 


<KE,F) 

^/EiFF  4>{Ei,  F) 


(7) 


is  a  simple  unigram  query,  Cunordered  windows>  is  a 
group  of  #uw  query  operators,  each  with  a  window  size 
set  to  twice  the  number  of  query  terms  considered, 
and  Cordered  windows>  is  a  group  of  #1  query  op¬ 
erators.  Parameters  vl-v4  were  trained  on  last  year’s 
queries  and  will  be  discussed  in  Section  3.3. 


3.2  Small  Document  Model 

The  small  document  model  scores  each  entry  individ¬ 
ually,  and  then  combines  those  scores  into  an  overall 
feed  score.  Again,  ranking  by  the  posterior  probabil- 


where  </>  is  defined  as, 

<KE,F)=  (  n  P(L:|E)%H  .  (8) 

\neE  J 

This  centrality  scoring  favors  entries  that  share  a  lan¬ 
guage  more  closely  with  the  language  of  the  feed  as 
a  whole.  In  practice,  the  product  in  Equation  8  is 
only  taken  over  the  query  terms,  providing  a  query- 
conditioned  centrality  measure.  The  feed  prior  com¬ 
ponent  of  this  model  is  used  to  correct  for  the  overly 
optimistic  centrality  scoring,  and  is  proportional  to 
the  log  of  the  feed  length:  P(F)  oc  log(\F\).  See 
[2]  for  a  more  thorough  description  of  this  retrieval 
model. 
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3.3  Parameter  Estimation 

The  above  models  have  several  free  parameters  for 
smoothing  and  model  combination  that  must  be  set. 
All  of  these  parameters  were  set  via  a  simple  grid 
search  with  a  step  size  of  0.1  using  last  year’s  queries 
and  relevance  judgements  [5].  The  parameter  settings 
were  used  in  Equations  5  and  2  are  shown  in  Table  3. 

Model  permtext  title  entrytitle  entrybody 

(vl)  (v2)  (v3)  (v4) 

LD  0.3  0.5  0.1  0.1 

SD  0.6  -  0.1  0.3 

Table  3:  Weight  settings  ( W; )  for  different  document 
representations. 

The  smoothing  parameters  used  were  /i  =  2500  for 
the  large  document  model  and  A e  =  0.4,  \r  =  0.3 
and  Ac  =  0.3  for  the  small  document  model. 

4  Wikipedia  Link-based  Ex¬ 
pansion 

As  demonstrated  at  last  year’s  TREC  submission, 
blog  retrieval  can  greatly  benefit  from  specifically  de¬ 
signed  query  expansion  methods.  In  this  year’s  sub¬ 
mission,  we  apply  the  same  query  expansion  method 
that  utilizes  the  link  structure  in  Wikipedia  to  dis¬ 
cover  topics  related  to  the  query. 

The  original  Wikipedia  markup  includes  cross¬ 
article  links.  Each  link  is  specified  by  its  target 
Wikipedia  article  and  its  anchor  text  or  anchor 
phrase,  which  may  differ  from  the  target  article’s  ti¬ 
tle.  Our  Wikipedia  link-based  expansion  method  ex¬ 
pands  the  query  with  related  anchor  phrases  from 
Wikipedia,  scoring  each  proportional  to  how  often  it 
occurs  in  links  to  documents  relevant  to  the  query. 

The  unexpanded  query  is  issued  as  a  dependence 
model  query  to  our  Wikipedia  index,  comprised 
of  2,471,311  articles,  excluding  date  and  category 
pages,  from  the  English  Wikipedia.  From  the  result¬ 
ing  ranking,  two  document  sets  are  defined:  the  top 
R  documents  are  defined  as  the  relevant  set,  Sr,  and 
the  top  W  documents  are  defined  as  the  working  set, 
Sw-  Because  R  <  W,  it  follows  that  Sr  C  %.  The 
method  focuses  on  anchor  phrases  appearing  in  ar¬ 
ticles  in  Sw  that  link  to  an  article  in  Sr.  Anchor 
phrase  cq  is  scored  according  to, 

A,;  =  ^2  X  ^target  (a*3.)  g  SrJ  x 

a>i  .  £«Sw 

(R  -  rank  (target  (aq)))  ,  (9) 


where  a*.  denotes  an  occurrence  of  anchor  phrase  a*, 
target  (a j^.)  is  the  article  linked  to  by  anchor  phrase 
occurrence  a, . ,  rank  (target  (a*  A)  denotes  the  rank  of 
target  (a  j . ),  and  X  is  the  identity  function.  The  unex¬ 
panded  query  was  augmented  with  the  most  highly 
scoring  20  expansion  phrases. 

The  method  aims  to  fulfill  two  desired  properties 
of  expansion  phrases  —  that  they  relate  to  the  query 
and  that  they  are  popular  terms,  likely  to  appear  in 
other  documents.  If  two  candidate  expansion  phrases 
appear  equal  number  of  times  in  Sw,  the  one  appear¬ 
ing  in  links  to  the  most  highly  ranked  documents  will 
score  higher.  If  two  candidate  expansion  terms  ap¬ 
pear  in  links  to  the  same  document,  the  most  fre¬ 
quent  one  will  score  higher.  Prior  work  shows  this 
method  resulting  in  higher  retrieval  performance  than 
pseudo-relevance  feedback  (PRF)  on  the  same  exter¬ 
nal  resource,  the  Wikipedia  [2].  An  evaluation  of 
the  sensitivity  of  this  query  expansion  method  to  the 
parameters  R  and  W  is  presented  in  previous  work 
[2,  1],  and  the  parameter  settings  used  here  are  taken 
from  that  work. 

5  Splog  Detection 

Our  method  for  splog  detection  combined  four  dis¬ 
tinct  classifiers.  Three  were  rule-based  classifiers 
based  on  potentially  high  precision,  low  coverage 
heuristics.  The  fourth  classifier  is  a  maximum-margin 
classifier  based  on  bag-of-word  features. 

5.1  Post  Time  Interval 

Some  splogs  are  machine-generated,  containing  non¬ 
sense  text  or  even  snippets  of  text  weaved  together 
from  other  blogs  or  websites  [4],  Machine-generated 
splogs  may  be  characterized  by  an  unusually  consis¬ 
tent  time  interval  between  consecutive  posts.  Our 
post  time-interval  classifier  (TI)  classifies  a  blog  as 
spam  if  its  time  interval  between  consecutive  posts 
varies  by  no  more  than  10  seconds.  This  classifier 
labeled  366  blogs  as  spam,  of  which  333  (91%)  were 
not  labeled  as  spam  by  any  other  classifier. 

5.2  Term  Compression 

Many  splogs  exist  to  promote  and  advertise  affiliated 
sites  [3] .  These  splogs  may  be  characterized  by  an  un¬ 
usually  small  ratio  of  vocabulary  size  to  term  count. 
Our  term  compression  classifier  (TC)  classifies  a  blog 
as  spam  if  X  <  6.5%  of  its  unique  terms  account  for 
Y  >  50%  of  its  total  term  count.  We  set  Y  =  50% 
and  tuned  X  to  maximize  MAP  on  last  year’s  queries 
and  relevance  judgements  [5].  This  classifier  labeled 
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4523  blogs  as  spam,  of  which  1687  (41%)  were  not 
labeled  as  spam  by  any  other  classifier. 

5.3  Link  Compression 

Some  splogs  exist  to  artificially  inflate  the  PageRank 
of  affiliated  sites  [3].  These  splogs  may  be  character¬ 
ized  by  an  unusually  high  percentage  of  hyperlinks 
linking  to  the  same  URL(s).  Our  link  compression 
classifier  (LC)  classifies  a  blog  as  spam  if  X  <  1%  of 
its  unique  link-to  URLs  account  for  Y  >  70%  of  its 
total  hyperlink  count.  We  set  Y  =  70%  and  tuned 
X  to  maximize  MAP  on  last  year’s  queries  and  rele¬ 
vance  judgements  [5] .  This  classifier  labeled  640  blogs 
as  spam,  of  which  602  (94%)  were  not  labeled  as  spam 
by  any  other  classifier. 

5.4  SVM  classifier 

The  lexical  features  of  a  blog  may  provide  evidence 
that  it  is  spam  (e.g.,  pornographic  content).  An 
SVM  bag-of-words  model  was  trained  using  a  pub¬ 
licly  available  ham/spam  blog  data  set23.  Terms 
were  weighted  by  Pmle(w\F).  This  classifier  labeled 
10839  blogs  as  spam,  of  which  8365  (77%)  were  not 
labeled  as  spam  by  any  other  classifier. 

5.5  Combining  Classifiers 

These  four  spam  classifiers  were  combined  log-linearly 
and  integrated  into  the  large  document  model  in 
the  form  of  query-independent  document  priors  (For¬ 
mula  1)  as, 

log  P(F)  oc  log  P(ham\F)  =  ^  f,  log  cq  (10) 

where  ft  denotes  the  binary  output  of  the  above 
classifiers  and  each  a,;  denotes  the  weight  associated 
with  each  classifier.  We  set  log  oiti  =  —100  man¬ 
ually  based  on  our  hypothesis  that  its  precision  is 
high.  Holding  this  value  constant,  log  arc  =  —3, 
log  any  =  —  l,and  log  asvM  =  —1  were  set  by  doing 
a  grid  search  to  maximize  MAP  on  last  year’s  queries 
and  relevance  judgements  [5]. 

6  Results 

Our  4  submitted  runs  (cmuSD  =  SD  model, 
cmuSDWiki  =  SD  +  Wiki  expansion,  cmuLDWiki  = 
LD  +  Wiki  expansion,  and  cmuLDwikiSP  =  LD  + 
Wiki  expansion  +  splog  detection)  used  only  the 
TREC  topic  title  field.  Results  are  given  in  Table  4. 

2http:/ /s  vmlight.joachims.org/ 

2 http:/ /ebiquity.  umbc.edu/resource/html/id/212/Splog- 
Blog-Dataset 


run 

MAP 

P@10 

R-Prec 

cmuSD 

0.246 

0.372 

0.3086 

cmuSDwiki 

0.259 

0.372 

0.3178 

cmuLDwiki 

0.302 

0.422 

0.3534 

cmuLDwikiSP 

0.306 

0.434 

0.3646 

Table  4:  Results 


Figures  1(a)  and  1(b)  show  cmuLDWikiSP’s  per- 
query  performance  in  terms  of  average  precision  (AP) 
and  R.-Precision  (R.-Prec),  respectively,  alongside  the 
per-query  median  and  best  performance.  Queries 
are  sorted  along  the  x  axis  in  descending  order  of 
cmuLDWikiSP  performance.  Dots  indicate  the  queries 
for  which  cmuLDWikiSP  obtained  the  best  perfor¬ 
mance. 

7  Wikipedia  Link-based  Ex¬ 
pansion  Error  Analysis 

Here,  we  focus  on  the  queries  that  were  helped  or 
hindered  the  most  by  Wikipedia  link-based  expan¬ 
sion.  Comparing  cmuSD  vs.  cmuSDwiki,  Wikipedia 
link-based  expansion  improved  MAP  for  33/50  (66%) 
queries.  The  ones  with  the  largest  MAP  increase  were 
“road  cycling”  (372%),  “U.S.  national  park”  (266%) 
and  “theater”  (138%).  Wikipedia  link-based  expan¬ 
sion  found  valuable  expansion  terms  for  these  queries 
because  most  documents  in  the  relevant  set,  Sr,  were 
in  fact  relevant.  These  were  queries  with  many  rele¬ 
vant  articles  in  the  Wikipedia  (more  than  R  =  100) 
in  the  form  of  relevant  named  entities  (e.g.,  names  of 
cyclists  and  cycling  events,  and  names  of  parks  and 
organizations  dealing  with  parks) .  The  top  10  expan¬ 
sion  terms  and  scores  for  “road  cycling”  were, 


cycling 

0.135 

lance  armstrong 

0.107 

uci 

0.078 

discovery  channel  pro  cycling  team 

0.072 

road  bicycle  racing 

0.071 

uci  protour 

0.061 

paolo  bettini 

0.054 

discovery  channel 

0.051 

union  cycliste  internationale 

0.050 

george  hincapie 

0.040 

(33%)  queries  saw  a  decrease  in  MAP.  One  interest¬ 
ing  observation  is  that  non-relevant  documents  in  Sr 
can  lead  to  non-relevant  expansion  terms  even  when 
ranked  below  relevant  documents.  Wikipedia  link- 

based  expansion  scores  anchor  phrases  proportional 
to  their  frequency  and  the  rank  of  the  article  they 
link  to.  Therefore,  it  is  possible  for  an  anchor  phrase 
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cmuLDwikiSP  Average  Precision 


cmuLDwikiSP  R-Precision 


(a)  Per-query  AP 


(b)  Per-query  R-Prec 


Figure  1:  Per-query  results:  median,  best,  and  cmuSDwikiDP 


used  in  links  to  a  mid-ranked  article  to  outscore  one 
used  in  links  to  a  top-ranked  article,  if  the  first  oc¬ 
curs  more  frequently  than  the  second.  In  other  words, 
scores  are  biased  towards  anchor  phrases  linking  to 
(possibly  non-relevant)  articles  with  many  in-links. 

The  query  with  the  largest  drop  in  MAP  (70%)  was 
“food  in  Singapore” .  The  top  5  expansion  terms  and 


were, 

Singapore 

0.596 

lee  kuan  yew 

0.053 

orchard  road 

0.039 

national  university  of  Singapore 

0.033 

Singapore  airlines 

0.030 

These  non-relevant  expansion  phrases  originated 
from  links  to  non-relevant  documents.  Interestingly, 
the  unexpanded  query  successfully  ranked  article 
“Cuisine  in  Singapore”  above  those  linked  to  by  these 
expansion  phrases.  However,  “Cuisine  in  Singapore” 
had  a  total  of  21  in-links.  Each  of  the  articles  associ¬ 
ated  with  these  non-relevant  anchor  phrases  had  more 
than  100  in-links.  Article  “Singapore”  had  about 
8000.  The  high  number  of  in-links  associated  with 
these  non-relevant  articles  in  Sr  produced  these  poor 
expansion  phrases. 

This  same  error  type  caused  the  third  largest  drop 
in  MAP  (30%),  for  the  query  “3D  cities  globes”.  The 
top  5  expansion  terms  and  scores  were, 


golden  globe  award 

0.228 

golden  globes 

0.150 

duke  nukem  3d 

0.084 

globe 

0.082 

golden  globe 

0.066 

Here,  again,  the  top  ranked  documents  in  Sr,  were 
all  relevant  (“Live  Search  Maps”,  “Virtual  Globe”, 
“Google  Earth”,  and  “Polygonal  Modeling”).  How¬ 
ever,  all  had  fewer  than  10  in-links.  The  non-relevant 
article  “golden  globe  award” ,  which  contributed  three 
non-relevant  expansion  phrases,  was  ranked  15th  and 
had  739  in-links.  The  non-relevant  article  “Duke 
Nukem  3D”,  which  contributed  the  third  top  anchor 
phrase,  was  ranked  58th  and  had  133  in-links. 

This  analysis  shows  that  non-relevant  articles  in 
Sr ,  even  if  not  at  the  top  of  the  ranking,  are  spe¬ 
cially  damaging  when  they  have  many  in-links.  These 
potentially  damaging  non-relevant  articles  are  more 
likely  to  be  introduced  into  Sr  when  there  are  less 
than  R  relevant  articles  for  the  query.  This  analysis  is 
compatible  with  prior  results  showing  that  Wikipedia 
link-based  expansion  is  particularly  successful  when 
the  query  describes  a  broad,  general  topic,  likely  to 
have  many  (>  R)  relevant  articles  in  Wikipedia  [2]. 

8  Conclusion 

This  year  we  continued  our  evaluation  of  retrieval 
models  for  blog  feed  search,  applying  extensions  to 
the  previous  year’s  models.  We  also  experimented 
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with  query  expansion  for  this  task,  as  well  as  an  ex¬ 
panded  document  representation  and  spam  filtering. 
The  best  performing  retrieval  model  from  last  year’s 
submission  continued  to  perform  well  this  year,  but 
the  extensions  to  last  year’s  small  document  model 
did  not.  Query  expansion  also  showed  promising  im¬ 
provements  on  this  year’s  query  set,  and  spam  filter¬ 
ing  provided  a  slight  performance  boost. 
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